Pricing

Pay per usage

RAG Text Chunker — heading & sentence aware, Japanese ready

Split Markdown or plain text into retrieval-ready chunks for RAG pipelines: cuts at headings, packs whole sentences up to a size limit with optional overlap, and tags every chunk with its heading breadcrumb. Handles Japanese sentence boundaries. No LLM cost.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Shinobu Otani

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

RAG Text Chunker

Split Markdown or plain text into retrieval-ready chunks. Heading-aware, sentence-aware, Japanese-ready — deterministic, no LLM cost.

Cuts at headings first: chunks never mix sections; fenced code blocks are not mistaken for headings
Packs whole sentences up to max_chars; oversized sentences are hard-split as a last resort
Optional overlap between consecutive chunks for retrieval continuity
Japanese-aware boundaries: 。！？ with closing-quote handling alongside Latin .!? (decimals like 3.14 stay intact)
Heading breadcrumbs: every chunk carries heading_path for citation

Input

{"documents": ["# 概要\n\n検証は三段階で行う。まず再現する。"], "max_chars": 1500, "overlap": 200}

Output (one dataset item per chunk)

{"id": 0, "document_index": 0, "heading_path": ["概要"], "text": "検証は三段階で行う。 まず再現する。", "char_count": 19}

Typical uses: chunking docs/knowledge bases before embedding; Japanese or mixed-language corpora for vector search; reproducible chunk boundaries.

Text Splitter & Chunker for RAG / LLMs

zenomastro/text-splitter-for-llm

Split text into clean, overlapping chunks ready for embeddings, vector databases, RAG and LLM context. Configurable size, overlap, and split strategy.

Rosario Vitale

RAG-Ready Markdown Converter & Chunker

foxpink/apify-rag-markdown-chunker

Convert raw HTML/text into clean Markdown and split into ready-to-ingest chunks for RAG pipelines, Vector DBs, and LLM fine-tuning workflows.

Nguyễn Anh Duy

4.7

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Docs-to-RAG Crawler

automation-lab/docs-rag-crawler

Crawl documentation sites (ReadTheDocs, GitBook, Docusaurus, Mintlify) into RAG-ready Markdown/JSON chunks with stable chunk IDs, heading breadcrumbs, word counts, and token estimates.

Stas Persiianenko

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

Markdown RAG Chunker

codepoetry/markdown-rag-chunker

Chunk any document for RAG — PDF, HTML, Word, Excel, PPTX, Markdown and more. Header-aware splits with token counts and stable IDs.

CodePoetry

Website Content Pipeline for AI: Markdown, Tokens, RAG Chunks

scrapemint/website-content-crawler

Crawl any website and ship clean Markdown, plain text, and HTML for AI, LLM, and RAG pipelines. Each row carries token estimates, JSON LD metadata, link graph, and optional auto chunk splitting for vector databases. Pay per page.

Ken M

Rag Content Chunker

labrat011/rag-content-chunker

Turn raw text, Markdown, or Apify datasets into token-perfect RAG chunks with deterministic IDs, source metadata, and a billing-ready summary—ready for embeddings or vector DBs without extra glue code.

mick_

Website to Text & Markdown — AI / RAG Content Crawler

inexhaustible_glass/rag-website-crawler

Scrape any website into clean text & Markdown with RAG-ready chunks and token counts for LLMs, vector databases (Pinecone, Qdrant) and AI chatbots. Also extracts linked PDF/Word/Excel. Anti-block, robots.txt-aware. Website-to-text for beginners, full RAG pipeline for pros. CPU only, no API key.

Hitman studio

Japanese Text Normalizer — NFKC, kana, whitespace, sentences

shoebill-dev27/jp-text-normalizer

Normalize Japanese text for data pipelines: Unicode NFKC (full/half-width unification), wave-dash unification, whitespace cleanup, hiragana/katakana conversion, Japanese-aware sentence splitting, and per-script character stats.